Cluster analysis continued

Wine data set

Standardize the features

Check that the standardization was "successful".

Standardization first CENTERS and then SCALES. Where CENTER refers to subtracting the SAMPLE AVERAGE and SCALE refers to dividing by the SAMPLE STANDARD DEVIATION.

If we have a variable, $x_{n,d}$, the standardized value is:

$$ \tilde{x}_{n,d} = \frac{x_{n,d} - \mathrm{mean}\left(x_{:,d}\right)}{\mathrm{std}\left(x_{:,d}\right)} $$

Use PCA to help with visualization

Plot the Cultivar region on the PC scatter plot.

KMeans

Find the optimal number of clusters using the Silhouette Coefficient.

Hierarchical clustering

We still want to use the standardized features.

Visualize the hierarchical cluster results via a DENDROGRAM.

Single linkage.

Average linkage.

Centroid linkage.

Ward method

Cut the tree to create the clusters

We need to convert the 2D array into a 1D array.

Use a heat map to look at the cross-tabulation between the cluster assignment and the known Cultivar group.

Examine the original features

Reshape the wide format into a long format.

Summarize each variable within each cluster.